class: center, middle, inverse, title-slide .title[ # Automatic Sampling and Analysis of YouTube Data ] .subtitle[ ## Introduction ] .author[ ### Johannes Breuer, Annika Deubel, & M. Rohangis Mohseni ] .date[ ### February 14th, 2023 ] --- layout: true --- ## Goals of this course After this course you should be able to... - automatically collect *YouTube* data - process/clean it - do some basic (exploratory) analyses of user comments --- ## About us **Johannes Breuer** .small[ - Senior researcher in the team *Data Augmentation*, department [*Survey Data Curation*](https://www.gesis.org/en/institute/departments/survey-data-curation) at *GESIS* - digital trace data for social science research - data linking (surveys + digital trace data) - (Co-)leader of the team *Research Data & Methods* at the [*Center for Advanced Internet Studies* (CAIS)](https://www.cais-research.de/) - Ph.D. in Psychology, University of Cologne - Research interests - Use and effects of digital media - Computational methods - Data management - Open science [johannes.breuer@gesis.org](mailto:johannes.breuer@gesis.org), [@MattEagle09](https://twitter.com/MattEagle09), [personal website](https://www.johannesbreuer.com/) ] --- ## About us **Annika Deubel** .small[ - M.Sc. in Applied Cognitive and Media Sciences (University of Duisburg-Essen) - Ph.D. candidate at the University of Duisburg-Essen - Researcher in the team *Research Data and Methods* at the *Center for Advanced Internet Studies* (CAIS) - Main area: health communication and information on social media - Other research interests: - Data and Algorithm Literacy - Computational methods [annika.deubel@cais-research.de](mailto:annika.deubel@cais-research.de), [@anndeub](https://twitter.com/anndeub) ] --- ## About us **M. Rohangis Mohseni** - Postdoctoral researcher (Media Psychology) at TU Ilmenau - Ph.D. in Psychology, University Osnabrueck - Ongoing habilitation "sexist online hate speech" 👿 - Other research interests - Electronic media effects - Moral behavior [rohangis.mohseni@tu-ilmenau.de](mailto:rohangis.mohseni@tu-ilmenau.de), [@romohseni](https://twitter.com/romohseni) --- ## About you - What's your name? - Where do you work? - What is your experience with `R`? - Why/how do you want to use *YouTube* for your research? --- ## Prerequisites for this course - Working version of `R` >= 4.0.0 and a recent version of [RStudio](https://rstudio.com/products/rstudio/download/#download) - Some basic knowledge of `R` - Interest in working with *YouTube* data --- ## Workshop Structure & Materials - The workshop consists of a combination of lectures and hands-on exercises - Slides and other materials are available at [https://github.com/jobreu/youtube-workshop-gesis-2023](https://github.com/jobreu/youtube-workshop-gesis-2023) We also put the PDF versions of the slides and some other materials on the [GESIS Ilias](https://ilias.gesis.org/) repository for this course. --- ## Online format - If possible, we invite you to turn on your camera - Feel free to ask questions anytime - If you have an immediate question during the lecture parts, please send it via text chat, publicly or privately (ideally to a person who is currently not presenting) - If you have a question that is not urgent and might be interesting for everybody, you can also use audio (& video) to ask it at the end of a lecture part or during the exercises (please use the use the "raise hand" function in *Zoom* for this) - We would kindly ask you to mute your microphones when you are not asking (or answering) a question --- ## Preliminaries: Base `R` vs. `tidyverse` In this course, we will use a mixture of base `R` and `tidyverse` code as Johannes prefers the `tidyverse`, and Annika and Ro use both. .small[ ICYC, here are some opinions [for](http://varianceexplained.org/r/teach-tidyverse/) and [against](https://blog.ephorie.de/why-i-dont-use-the-tidyverse) using/teaching the `tidyverse`. ] --- ## The `tidyverse` If you've never seen `tidyverse` code, the most important thing to know is the `%>%` [(pipe) operator](https://magrittr.tidyverse.org/reference/pipe.html). Put briefly, the pipe operator takes an object (which can be the result of a previous function) and pipes it (by default) as the first argument into the next function. This means that `function(arg1 = x)` is equivalent to `x %>% function()`. It may also be worthwhile to know/remember that `tidyverse` functions normally produce [`tibbles`](https://tibble.tidyverse.org/) which are a special type of dataframe (and most `tidyverse` functions also expect dataframes/tibbles as input to their first argument). --- ## The `tidyverse` If you want a short primer (or need a quick refresher) on the `tidyverse`, you can check out the blog [post by Dominic Royé](https://dominicroye.github.io/en/2020/a-very-short-introduction-to-tidyverse/). For a more in-depth exploration of the `tidyverse`, you can, e.g., have a look at the [workshop by Olivier Gimenez](https://oliviergimenez.github.io/intro_tidyverse/#1). And the book [*R for Data Science*](https://r4ds.had.co.nz/) by Hadley Wickham and Garrett Grolemund (which is available for free online) provides a very comprehensive introduction to the `tidyverse`. --- ## Preliminaries: What's in a name? Another thing you might notice when looking at our code is that we love 🐍 as much as 🐫. <img src="data:image/png;base64,#https://bookdown.org/joone/ComputationalMethods/img/horst/coding_cases.png" width="50%" style="display: block; margin: auto;" /> .center[ <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> ] --- ## Course schedule .center[**Tuesday, February 14th, 2023**] <table> <thead> <tr> <th style="text-align:center;"> When? </th> <th style="text-align:center;"> What? </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 09:00 - 10:00 </td> <td style="text-align:center;"> Introduction </td> </tr> <tr> <td style="text-align:center;"> 10:00 - 11:00 </td> <td style="text-align:center;"> The YouTube API </td> </tr> <tr> <td style="text-align:center;"> 11:00 - 11:15 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 11:15 - 12:15 </td> <td style="text-align:center;"> Tools for collecting YouTube data </td> </tr> <tr> <td style="text-align:center;"> 12:15 - 13:15 </td> <td style="text-align:center;"> <i>Lunch Break</i> </td> </tr> <tr> <td style="text-align:center;"> 13:15 - 14:45 </td> <td style="text-align:center;"> Collecting YouTube data with R </td> </tr> <tr> <td style="text-align:center;"> 14:45 - 15:00 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 15:00 - 16:30 </td> <td style="text-align:center;"> Processing and cleaning user comments </td> </tr> </tbody> </table> --- ## Course schedule .center[**Wednesday, February 15th, 2023**] <table> <thead> <tr> <th style="text-align:center;"> When? </th> <th style="text-align:center;"> What? </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 09:00 - 10:30 </td> <td style="text-align:center;"> Basic text analysis of user comments </td> </tr> <tr> <td style="text-align:center;"> 10:30 - 10:45 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 10:45 - 12:15 </td> <td style="text-align:center;"> Sentiment analysis of user comments </td> </tr> <tr> <td style="text-align:center;"> 12:15 - 13:15 </td> <td style="text-align:center;"> <i>Lunch Break</i> </td> </tr> <tr> <td style="text-align:center;"> 13:15 - 14:45 </td> <td style="text-align:center;"> Excursus: Retrieving video subtitles </td> </tr> <tr> <td style="text-align:center;"> 14:45 - 15:00 </td> <td style="text-align:center;"> <i>Coffee Break</i> </td> </tr> <tr> <td style="text-align:center;"> 15:00 - 16:30 </td> <td style="text-align:center;"> Practice session, questions, and outlook </td> </tr> </tbody> </table> --- ## Why is *YouTube* relevant? - Important online video platform<br /><small>([Alexa Traffic Ranks, 2019](https://www.alexa.com/topsites); [Konijn, Veldhuis, & Plaisier, 2013](https://doi.org/10.1089/cyber.2012.0357))</small> - Esp. popular among adolescents who use it to, e.g., watch movies & shows, listen to music, and retrieve information<br /><small>([Feierabend, Plankenhorn, & Rathgeb, 2016](https://www.mpfs.de/studien/kim-studie/2016/))</small> - For adolescents, *YouTube* partly replaces TV<br /><small>([Defy Media, 2017](http://www.newsroom-publicismedia.fr/wp-content/uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf))</small> - YouTubers can be social media stars<br /><small>([Budzinski & Gaenssle, 2018](https://doi.org/10.1080/08997764.2020.1849228))</small> --- ## Why is *YouTube* data interesting for research? - Content producers and users generate huge amounts of data - These data can be useful for research on media content, communicators, and user interaction - The data are publicly available and relatively easy to retrieve via the *YouTube* API - For some further reasons and examples, see [Arthurs et al., 2019](https://doi.org/10.1177/1354856517737222); [Baertl, 2018](https://doi.org/10.1177/1354856517736979) --- ## Research Examples - Audience - Usage of YouTube<br /><small>([Defy Media, 2017](http://www.newsroom-publicismedia.fr/wp-content/uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf))</small> - Experiences with YouTube<br /><small>([Defy Media, 2017](http://www.newsroom-publicismedia.fr/wp-content/uploads/2017/06/Defi-media-acumen-Youth_Video_Diet-mai-2017.pdf); [Lange, 2007](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.3808&rep=rep1&type=pdf); [Moor et al., 2010](https://doi.org/10.1016/j.chb.2010.05.023); [Oksanen, et al. 2014](https://doi.org/10.1108/S1537-466120140000018021); [Szostak, 2013](https://journals.mcmaster.ca/mjc/article/view/280); [Yang et al., 2010](https://doi.org/10.1089/cyber.2009.0105))</small> - Video consumption<br /><small>([Montes-Vozmediano et al., 2018](https://doi.org/10.3916/C54-2018-06); [Tucker-McLaughlin, 2013](https://scholar.google.de/scholar?cluster=6669321353953252964))</small> - Radicalization<br /><small>([Albadi et al., 2022](https://arxiv.org/abs/2207.00111); [Ribeiro et al., 2020](https://doi.org/10.1145/3351095.3372879))</small> - Community formation<br /><small>([Kaiser & Rauchfleisch, 2020](https://doi.org/10.1177/2056305120969914))</small> --- ## Research Examples - Content - Incivility / Hate Speech in comments<br /><small>([Döring & Mohseni, 2019a](https://doi.org/10.1080/14680777.2018.1467945), [2019b](https://doi.org/10.1080/08824096.2019.1634533), [2020](https://doi.org/10.5771/2192-4007-2020-1-62); [Obadimu et al, 2019](https://doi.org/10.1007/978-3-030-21741-9_22); [Spörlein & Schlueter, 2021](https://doi.org/10.1093/esr/jcaa053); [Wotanis & McMillan, 2014](https://doi.org/10.1080/14680777.2014.882373))</small> - Commenter attributes<br /><small>([Literat & Kligler-Vilenchik, 2021](https://doi.org/10.1177/20563051211008821); [Röchert et al., 2020](https://doi.org/10.5117/CCR2020.1.004.ROCH); [Thelwall & Mas-Bleda, 2018](https://doi.org/10.1108/AJIM-09-2017-0204))</small> - Comment characteristics<br /><small>([Thelwall, 2018](https://doi.org/10.1080/13645579.2017.1381821); [Thelwall et al., 2012](https://doi.org/10.1002/asi.21679))</small> - Video content<br /><small>([Kohler & Dietrich, 2021](https://doi.org/10.3389/fcomm.2021.581302); [Utz & Wolfers, 2020](https://doi.org/10.1080/1369118X.2020.1804984))</small> --- ## Research Examples - Communicator - Video production<br /><small>([Utz & Wolfers, 2020](https://doi.org/10.1080/1369118X.2020.1804984))</small> - Extremism / Ideology<br /><small>([Rauchfleisch & Kaiser, 2020](https://doi.org/10.1080/08838151.2020.1799690), [2021](https://doi.org/10.2139/ssrn.3867818); [Dinkov et al., 2019](https://arxiv.org/abs/1910.08948); [Ribeiro et al., 2020](https://doi.org/10.1145/3351095.3372879))</small> - Gender / Diversity<br /><small>([Chen et al, 2021](https://doi.org/10.1177/14614448211034846); [Wegener et al., 2020](https://doi.org/10.5204/mcj.2728); [Thelwall & Mas-Bleda, 2018](https://doi.org/10.1108/AJIM-09-2017-0204))</small> - Economical aspects<br /><small>([Budzinski & Gaenssle, 2018](https://doi.org/10.1080/08997764.2020.1849228))</small> - Channel hierarchy / Ranking<br /><small>([Rieder et al., 2018](https://doi.org/10.1177/1354856517736982); [Rieder et al., 2020](https://doi.org/10.5210/fm.v25i8.10667))</small> --- ## How to collect *YouTube* data There are many different ways in which data from *YouTube* and other social media can be collected (see [Breuer et al., 2020](https://journals.sagepub.com/doi/10.1177/1461444820924622)): - Manually (e.g., via copy & paste and manual content analysis) - Using existing data, such as [*YouNiverse: Large-Scale Channel and Video Metadata from English YouTube*](https://zenodo.org/record/4650046) (also see the accompanying preprint by [Ribeiro & West, 2021](https://arxiv.org/abs/2012.10378)) - Automatically via the *YouTube* API or web scraping --- ## Comparisons of Approaches for Collecting *YouTube* Data .small[ <table> <thead> <tr> <th style="text-align:center;"> Software </th> <th style="text-align:center;"> Type </th> <th style="text-align:center;"> Can collect </th> <th style="text-align:center;"> Comment Scope </th> <th style="text-align:center;"> Needs API Key </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> YouTube Data Tools 1.22 </td> <td style="text-align:center;"> Website </td> <td style="text-align:center;"> Channel Info, Video Info, Comments </td> <td style="text-align:center;"> Only all </td> <td style="text-align:center;"> No </td> </tr> <tr> <td style="text-align:center;"> Webometric 4.1 </td> <td style="text-align:center;"> Standalone app </td> <td style="text-align:center;"> Channel Info, Video Info, Comments, Video Search </td> <td style="text-align:center;"> 100 most recent or all </td> <td style="text-align:center;"> Yes </td> </tr> <tr> <td style="text-align:center;"> Tuber 0.9.9 </td> <td style="text-align:center;"> R package </td> <td style="text-align:center;"> Channel Info, Video Info, Comments, Subtitles, All searches </td> <td style="text-align:center;"> 20-100 most recent or all </td> <td style="text-align:center;"> Yes </td> </tr> <tr> <td style="text-align:center;"> vosonSML 0.29.13 </td> <td style="text-align:center;"> R package </td> <td style="text-align:center;"> Video IDs, Comments </td> <td style="text-align:center;"> 1-x top-level </td> <td style="text-align:center;"> Yes </td> </tr> <tr> <td style="text-align:center;"> youtubecaption 1.0.0 </td> <td style="text-align:center;"> R package </td> <td style="text-align:center;"> Subtitles </td> <td style="text-align:center;"> n/a </td> <td style="text-align:center;"> No </td> </tr> </tbody> </table> ] --- ## Exemplary Comparison of the Different Tools .small[ <table> <thead> <tr> <th style="text-align:center;"> Software </th> <th style="text-align:center;"> Ease of Use </th> <th style="text-align:center;"> Disadvantages </th> <th style="text-align:center;"> No. of Comments </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> YouTube Data Tools 1.22 </td> <td style="text-align:center;"> High </td> <td style="text-align:center;"> Lacking flexibility, less information </td> <td style="text-align:center;"> 52,243 </td> </tr> <tr> <td style="text-align:center;"> Webometric 4.1 </td> <td style="text-align:center;"> Low </td> <td style="text-align:center;"> Only first 5 follow-up comments, no error feedback, undetectable time-outs </td> <td style="text-align:center;"> 49,150 </td> </tr> <tr> <td style="text-align:center;"> Tuber 0.9.9 </td> <td style="text-align:center;"> Low </td> <td style="text-align:center;"> Only first 5 follow-up comments </td> <td style="text-align:center;"> 49,139 </td> </tr> <tr> <td style="text-align:center;"> vosonSML 0.29.13 </td> <td style="text-align:center;"> Low </td> <td style="text-align:center;"> Lacking flexibility, only comments </td> <td style="text-align:center;"> 50,619 </td> </tr> </tbody> </table> ] Example data source: [Dayum Video](https://www.youtube.com/watch?v=DcJFdCmN98s) --- ## A note on using FOSS The tools listed before are free and open source software (FOSS). Using FOSS has many advantages (availability, adaptability, etc.). However, one risk associated with using FOSS is that tools are not maintained anymore and, hence, cease to function. After all, people create and maintain these tools in their spare time or as side projects and this work is often not recognized enough (esp. within academia). For this reason it is especially important to acknowledge the work that goes into these tools by properly citing them. .small[ ```r citation("tuber") ``` ``` ## ## To cite package 'tuber' in publications use: ## ## Gaurav Sood (2020). tuber: Access YouTube from R. R package version 0.9.9. ## ## Ein BibTeX-Eintrag für LaTeX-Benutzer ist ## ## @Manual{, ## title = {tuber: Access YouTube from R}, ## author = {Gaurav SOod}, ## year = {2020}, ## note = {R package version 0.9.9}, ## } ``` ] --- class: center, middle # Any questions so far?